Machine Learning

1. Paradigms
2. Perception Space and Concept Space.
3. World Model
4. Latent Space
- 4.1. Visualization
5. Loss Function
- 5.1. Functions
  - 5.1.1. Cross-Entropy Loss
    - 5.1.1.1. Advantages
6. Regression
- 6.1. Linear Regression
7. Variational Inference
8. Gradient Descent/Ascent
- 8.1. Learning Rate
9. Hopfield Network
10. Sparse Regression
- 10.1. Loss Function
11. Neural Network
12. Reinforcement Learning
13. Interpretation
- 13.1. Mechanistic Interpretability
14. History
- 14.1. Moore's Law
15. Products
- 15.1. Image Generation
- 15.2. Language Model
16. Tools
17. Applications
18. Reference

1. Paradigms

1.1. Supervised Leaning

Trained using labeled data

1.2. Unsupervised Leaning

Unlabeled data

1.3. Reinforcement Leaning

Use rewards

2. Perception Space and Concept Space.

A layer of neurons are like the classifying data with bunch of lines. Multiple layers correspond to cumulatively folding the paper.

3. World Model

An world model was given to the chess engine in 1970s. Then the AlphaGoZero build its own world model from its experience. LLM's world model is genral enough that it applies to individual cases.

The world model can be used to make the model dream. If the search space is too large, model uses intuition that is given by a neural network, and rollout random possible sequences of moves.

The recent GPT-4 uses the rollout to the thoughts within the tree of thoughts, to find the optimal response.

4. Latent Space

The embedding space.

4.1. Visualization

Latent Space Visualisation: PCA, t-SNE, UMAP | Deep Learning Animated - YouTube
Principal Component Analysis
- Fast but not extensible.
t-SNE
- Maximize the probability that the low dimensional representation happens, according to a Gaussian distribution around each point.
UMAP
- Find k nearest neighbor, and match the weights of the low dimensional adjacency graph with the high dimensional one.

5. Loss Function

5.1. Functions

5.1.1. Cross-Entropy Loss

Calculate the cross entropy of the prediction with respect to the values that are expected.

5.1.1.1. Advantages

Works well when the model outputs are probabilities
Good for imbalanced classes
Offers smoother gradients than other loss functions

6. Regression

6.1. Linear Regression

7. Variational Inference

The goal is to approximate the probability distribution \( P(x) \). We postulate the existence of a latent factor \( z \) that affects the probability of \( x \) given as \( P(x\mid z) \). The latent factor also has a probability distribution \( P(z) \), factoring the distribution of \( x \): \[ P(x) = P(x\mid z)P(z). \]

8. Gradient Descent/Ascent

8.1. Learning Rate

The coefficient of the gradient.

9. Hopfield Network

A Brain-Inspired Algorithm For Memory - YouTube
Nodes with symmetric weights between each pair of them. The weight signifies the correlation of two nodes.
- Model of the Hebbian theory.
One a network learns the appropriate weight. Then it can be recovered by sequential improvement.
- updating each note to be the most probable value at each sweep. It is guaranteed that it converges.
- Given that the interference is negligable.
Multiple information can be stored at the same time by averaging.
- \(n_\text{patterns} \approx 0.14\cdot N_\text{neurons}\)

10. Sparse Regression

10.1. Loss Function

The goal is to minimize the error of the approximation function and the complexity of the it. It is achieved with the loss function being an linear combination of two function: \[ L = L_{\mathrm{MSE}} + \alpha L_{\mathrm{count}}. \]

The \( L_{\mathrm{count}} \) can be in various form, something similar to \[ L_{\mathrm{count}} = \sum_i|w_i|^p \] when \( p \) is close to zero, this is an approximation of the maximum function. But often \( p=1 \) is used while producing reasonable result, and with simpler calculation using convex optimization.

\( w_{i} \) is the coefficient of the predetermined feature function, that linearly combines to produce the approximation function.

11. Neural Network

Machine learning can occur on this.

11.1. Rectifiers

11.1.1. ReLU

Rectified Linear Unit

\[ \mathrm{ReLU}(x) := \max (0,x). \]

11.1.2. Leaky ReLU

\[ \mathrm{LReLU}(x) := \max(x, \varepsilon x) \] where \( 0 < \varepsilon \ll 1. \)

11.1.3. ELU

Exponential Linear Unit

\[ \mathrm{ELU}(x) := \begin{cases} x & x >0, \\ \alpha(e^x -1) & x\le 0 \] where \( \alpha \ge 0 \).

11.2. Softmax

The softmax function \( \sigma\colon \mathbb{R}^k \to (0,1)^k \) where \( k\ge 1 \), is defined by: \[ \sigma(z_i) = \frac{e^{z_i}}{\sum_{j=1}^k e^{z_j}}. \]

11.3. Architecture

Neural Network Architectures & Deep Learning - YouTube

11.3.1. Multilayer Perceptron

Researched in 80s and 90s.
But what is a neural network? | Deep learning chapter 1 - YouTube
It is useful to have rough concrete meaning of each layer (beforehand), to assess the network.
Layer \(n\) Neurons \[a_{j}^{(n)}=\sigma\left( b_{j}^{(n)} + \sum_{i}w_{ij}^{(n)}a_{i}^{(n-1)}\right)\]
- where \(\sigma: \mathbb{R} \to [0,1]\) is usually the sigmoid function, aka logistic curve: \[ \sigma(x) = \frac{1}{1+e^{-x}} \]
  - It could be any other functions, such as ReLU.
- with matrix notation: \[ \mathbf{a}^{(n)} = \sigma\left(\mathbf{b}^{(n)}+\mathbf{W}^{(n)}\mathbf{a}^{(n-1)}\right) \]
The entire neural network can be thought of as a function, as well as each neuron. \[ \mathbf{a}^{(-1)} = \mathbf{M}\left(\mathbf{\mathbf{a}}^{(0)}\right) = \dots\mathbf{M}^{(2)}\mathbf{M}^{(1)}\mathbf{a}^{(0)} \]

11.3.2. Recurrent Neural Network

RNN
Serial processing. Like for audio.

11.3.2.1. Long Short-Term Memory Network

LSTMN
Deep layers of small recurrences.

11.3.3. Convolutional Neural Network

Used for image processing
Perform multiple 2D convolution at each layer, to detect features.

11.3.4. Autoencoder

Extract the core features and lift back to the original shape.

11.3.5. Transformer

Self Attention Layer: Dynamic layer that adapt the weights based on the context. Measures the distance in the concept space.
It is so named because it transforms the meaning of a sentence in the concept space.
It is shallower but wider.
Transformer Neural Networks Derived from Scratch - YouTube
Calculate the attention of every pair of input vectors with an attention head neural network, and use it as a weight to reduce it down.

11.3.6. Auto-Regression

Use the output, as the input for the next inference.

11.3.7. Diffusion

When generating an image, the diffusion model performs Langevin sampling, specifically using the Euler-Maruyama method: \[ \mathbf{x}_{t+1} = \mathbf{x}_t + \frac{\epsilon}{2} F(\mathbf{x}_t) + \sqrt{2\epsilon} \mathbf{z}_t, \] in order to sample from the space of all images, where the score function \( F \) is approximated by the trained model \( M_{\theta} \).

The random noise \( \mathbf{z}_t \) serves two main purpose.

Escaping the local maxima,
Providing high frequency information.

The model only learns to remove high frequency data, since they have to remove noise. Therefore constant supply of high frequency data is crucial for the generation of realistic image. Put it in other words, the noise alleviates the high correlation of contiguous pixels, generating a more realistic image.

11.4. Learning

Gradient descent, how neural networks learn | DL2 - YouTube

11.4.1. Quantification

11.4.1.1. Loss Function

The wrongness of the model with a single data
\(\operatorname{square\,loss}(f(x_i|\theta),y_i) = (f(x_i|\theta)-y_i)^2\)
For the case of vector output:
- \(\operatorname{square\,loss}(f(\mathbf{x}_{i}|\theta),\mathbf{y}_i) = |f(\mathbf{x}_i|\theta)-\mathbf{y}_i|^2\)
See loss function.

11.4.1.2. Cost Function

Error Function
The wrongness of the model with the entire dataset.
It can dramatically affect the resulting parameters of the model.
Input: Weights and Biases, Parameters: Dataset, Output: A Number.
- For a model, Input: Data point, Parameters: Weights&Biases, Output: Estimation.
e.g. Mean Square Error: \[ MSE(\theta) = \operatorname*{avg}_i\left(\operatorname{square\,loss}(f(x_i|\theta),y_i)\right) \]

11.4.1.3. Objective Function

A function that is to be optimized for a model.
Cost function is one of them.

11.4.2. Gradient Descent

Move in the direction of \(-\nabla C(\theta)\), repeatedly.
In order to calculate the gradient, the activation must be continuous and differentiable.

11.4.2.1. Stochastic Gradient Descent

Calculating the cost on the entire dataset every decent step is not practical, so we divide the data into batches to calculate on it each step.
It is not the optimal decent, but it speeds up the computation.

11.4.3. Backpropagation

Method of simplifying calculation by calculating layer by layer, starting from the last layer.
\[ \frac{\partial C_i}{\partial a_{k}^{(L-1)}} = \sum_j \frac{\partial C_i}{\partial a_j^{(L)}}\frac{\partial a_j^{(L)}}{\partial a_k^{(L-1)}} \]
Note that the gradient is readily calculated at each layer, since you already have all the values that the gradients is evaluated at.
This is an application of ((65747a84-db71-4b38-aaf9-a357720b2aa9)), which enables multiple partial derivative at the same time.
e.g. \[ \frac{\partial C_i}{\partial a_{k}^{(L-1)}} = \sum_j \left[2(a_j^{(L)}-y_j)\right]\left[w_{jk}^{(L)}\sigma'\!\!\left( z^{(L)}_j\right)\right] \]
where \(z^{(L)}_j = b_{j}^{(L)} + \sum_k w_{jk}^{(L)}a_{k}^{(L-1)}\).

11.5. History

The 35 Year History of LLMs - YouTube
Ilya - OpenAI
- RNN trained on Amazon reviews.
- Sentiment Neuron was found
GPT-1, GPT-2
- Transformer trained on various books and the data from the web.
GPT-3
- Hundred times bigger than the previous version, with a hundred seventy five billion connections and ninety six layers, context window of around a thousand words.
- In-context learning
  - 'prompt is the program'
- Through the reinforcement learning, InstructGPT later commercialized as ChatGPT in November, 2022.
LLM became the kernel system for the other tasks.
- Language processing is related to conciousness?
GPT-4, Gemini, Bard, Grok on 2023

11.6. See Also

The Most Important Algorithm in Machine Learning - YouTube

12. Reinforcement Learning

12.1. Markov Decision Process

Variant of Markov chain

12.1.1. Definition

12.1.1.1. Chain

It is discrete stochastic process in which an agent take actions \(A_t\) based on the current state \(S_t\), to get to the new state \(S_{t+1}\) and get reward \(R_{t+1}\).
The space of states is denoted as \(\mathcal{S}\), the spaces of actions at a state \(s\) as \(\mathcal{A}(s)\), and the space of rewards as \(\mathcal{R} \subset \mathbb{R}\).
- \(\mathcal{S}^+\) is used when it contains terminal states.
Tabular
- The process is tabular if all states and actions can be listed.

12.1.1.2. Dynamics

The dynamics of the agent-environment interaction is given by the probability distribution: \[ p(s', r |s, a) := \mathrm{P}[S_{t+1}=s', R_{t+1}=r\mid S_t = s, A_t = a]. \]
Complete Knowledge
- If the dynamics is known one is said to be having the complete knowlege.
Markov Property
- The probability only depend on the current state and action.

12.1.1.3. Policy

Policy is the probability of action (or function to action) at each state that needs to be optimized: \[ \pi(a|s) \text{\quad or\quad } \pi(s) = a. \]

12.1.1.4. Return

Return is the weighted sum of the future rewards: \[ G_t := \sum_{k=t+1}^\infty \gamma^{k-t-1}R_k \]
where \(\gamma\) is often a value between 0 and 1.
In the episodic case the process terminates, marking the end of an episode, also the summation, in contrast to the continuing case.

12.1.1.5. Trajectory

Sequence, possibly infinite, of states, actions, rewards produced by the process.
It can be expressed as random variables, \[ S_0, A_0, R_1, S_1, A_1, R_2,\dots, R_T,S_T \]
or as values, \[ s_0^m, a_0^m, r_1^m, s_1^m, a_1^m,\dots, r_T^m, s_T^m. \]

12.1.1.6. State Value Function

Expected return at state \(s\), that varies depending on the policy \(\pi\).
\[ v_\pi(s) := \mathrm{E}_\pi[G_t\mid S_t = s]. \]
It is typically impossible to find it directly.
Terminal states has value of zero by definition.

12.1.1.7. Action Value Function

Expected return of a state-action pair \((s, a)\). It also depends on \(\pi\).
\[ q_\pi(s,a) := \mathrm{E}_\pi[G_t\mid S_t = s, A_t = a] \]

12.1.2. Bellman Equations

For any policy \(\pi\),
\begin{align*} v_\pi(s) &= \sum_{a\in \mathcal{A}(s)}\pi(a|s)q_\pi(s,a)\\ q_\pi(s,a) &= \sum_{s'\in\mathcal{S}, r\in\mathcal{R}}p(s',r|s,a)[r+\gamma v_\pi(s')]\\ \end{align*}
- This is due to the fact \[ \mathrm{E}_\pi[G_{t+1}\mid s, a] = \mathrm{E}_\pi[v_\pi(S_{t+1})\mid s,a] \] by the law of total probability
Consequently,
\begin{align*} v_\pi(s) &= \sum_{a\in \mathcal{A}(s)}\pi(a | s)\sum_{s'\in \mathcal{S}, r\in \mathcal{R}} p(s', r |s, a)(r+\gamma v_\pi(s'))\\ q_\pi(s,a) &= \sum_{s'\in\mathcal{S}, r\in\mathcal{R}}p(s',r|s,a)\left[r+\gamma \sum_{a'\in \mathcal{A}(s')}\pi(a'|s')q_\pi(s',a')\right] \end{align*}

12.1.3. Optimization

The goal is to find the optimal policy \(\pi_*\) which satisfies:
- \[ \forall \pi, \forall s \in \mathcal{S}, v_*(s) := v_{\pi_*}(s) \ge v_\pi(s), \]
- \[ \forall \pi, \forall s\in \mathcal{S}, \forall a \in \mathcal{A}(s), q_*(s,a) := q_{\pi_*}(s,a) \ge q_\pi(s,a). \]
- The existence of such policy is guaranteed.

12.1.3.1. Bellman Optimality

\[ v_*(s) = \max_{a\in\mathcal{A}(s)} q_*(s, a). \]
This must be the case, since otherwise it is not optimal.
There could be multiple actions values that are equal, in that case there's multiple optimal policy.

12.2. Generalized Policy Iteration

Every reinforcement learning can be seen as a generalized policy iteration

\[ \pi_0 \xrightarrow{E} v_{\pi_0} \xrightarrow{I} \pi_1 \xrightarrow{E} \dots \xrightarrow{I} \pi_* \xrightarrow{E} v_* \]

12.2.1. Policy Evaluation

12.2.1.1. Model-Based

The approximate value of the state value function and action value function can be calculated by repeatedly calculating the value at each point \(V(s)\), \(Q(s,a)\) according to the Bellman equations.
It is guaranteed to converge.

12.2.1.2. Value Iteration

The value does not needs to reach convergence.

12.2.2. Model-Free

Since the model is unknown, estimating the action value is more feasible.

12.2.2.1. Monte Carlo Evaluation

See Monte Carlo method.
Estimate the state value function of the action-value pair state \(s := (\tilde{s}, \tilde{a})\) within the Markov reward process. \(V(s) = Q(\tilde{s},\tilde{a})\).
Given \(M\) trajectory samples of an MRP, it can be approximated by the average: \[ \mathrm{E}_\pi[G_t\mid S_t = s] \approx V(s) = Q(\tilde{s},\tilde{a}) := \frac{1}{C(s)}\sum_{m=1}^M\sum_{\tau = 0}^{T_m-1}\mathbf{1}[s_\tau^m = s]g_\tau^m \]
- where \(C(s)\) is the counting function of the occurence of \(s\)
- It can be divided into updates after each trajectory following the update rule: \[ Q(\tilde{s}_t^m, \tilde{a}_t^m) = V(s_t^m) \leftarrow V(s_t^m) + \frac{1}{C(s_t^m)}(g_t^m - V(s_t^m)) \]
  - here \(C(s_t^m)\) only counts up to the \(m\)th trajectory, up to time \(t\) within the last trajectory.
  - \(g_t^m\) for each \(s_t^m\) are determined after the \(m\)th trajectory terminates. Then, the states within that trajectory would be updated.
  - It correctly calculates the average at each iteration.
    - \[ \frac{(V(s_t^m)-1)V(s_t^m) + g_t^m}{C(s_t^m)} \]
Target is what the estimates are moved towards at each update.

12.2.2.2. Constant-α Monte Carlo

The update rule is modified to be \[ V(s_t^m) \leftarrow V(s_t^m) + \alpha (g_t^m - V(s_t^m)) \]
where the stepsize \(\alpha\in (0,1]\) is a constant.
If \(\alpha\) is small, it converges slowly, if \(\alpha\) is large, the estimate is noisy. The best hyperparameter \(\alpha\) is problem dependent.

12.2.2.3. Off-Policy Method

In on-policy method, the behavior policy \(b\) is equal to the target policy \(\pi\) \(b(a|s) = \pi(a|s)\). But in off-policy method \(b\neq\pi\), so the importance sampling is required: \[ q_\pi(s,a) = \mathrm{E}_\pi[G_t\mid S_t = s, A_t = a] = \mathrm{E}_b\left[\rho G_t\mid S_t = s, A_t = a\right] \] where \(\rho\) is the product of the ratio of probability over the remainder of the trajectory: \[ \rho := \frac{p_\pi(G_t)}{p_b(G_t)} = \prod_{\tau = t+1}^{T-1}\frac{\pi(A_\tau \mid S_\tau)}{b(A_\tau\mid S_\tau)}. \]

\(\rho_t^m\) is multiplied to the plain \(g_{t,b}^m\) to obtain \(g_t^m\) at each update phase.
Coverage is required \[ \pi(a|s) > 0 \implies b(a|s) > 0 \]
- Behavior policy must be able to take all the actions that target policy takes.
- This ensures that there's always data in the limiting case.
It explores more, has high variance.

12.2.2.4. n-Step Temporal Difference Learning

n-Step TD

If an episode takes significant amount of time, or even infinite time, the process of learning is slow. So one needs to learn during an episode.

12.2.2.4.1. Target

The target \(g_t^m\) is modified to: \[ g_{t;t+n}^m = \sum_{\tau = t+1}^{t+n < T}\left(\gamma^{\tau-t-1}r_\tau^m\right) + \gamma^n V(s_{t+n}^m). \] If \(t+n \ge T\), it is \(g_{t;t+n}^m = g_t^m\).

This allows one to computer the return only with the knowledge through \(n\)-step forward into the future. Using a value estimate in the target is called bootstrapping

12.2.2.4.2. Hyperparameter

\(n\) is a hyperparameter
- when \(n=\infty\) TD is identical to MC.
- \(\alpha\) can be larger compared to MC, when \(n\) is small. It converges faster without much noise.
- This is because the variance is reduced by bootstrapping, therefore larger steps can be taken.
- For small \(n\), typically slightly higher \(\alpha\) performs better.
- For larger \(n\), one learn more from a single episode. Since the signal from farther away can be considered, when \(n\) is large.
  - So the exploration of state space is fast, yet noisier.

12.2.2.4.3. Properties

Batch training is performing the training on a fixed batch, unlike the real time data.
MC and TD arrive at different estimates under batch training, MC finds the one that minimize mean-squared error, which is achieved by average, and TD finds the one with the maximum-likelihood of the MRP \(\hat{p}(s', r|s)\).
- \(\hat{p}(s',r|s) = C(s',r|s)/C(s)\) to estimate \(v\) is equivalent to 1-step batch TD.

TD's criteria is better for predicting returns.

Because it more directly uses the assumption of the MRP.

12.2.3. Policy Improvement

The optimal deterministic policy \(\pi_*\): \[ \pi_*(s) = \operatorname*{argmax}_a q_*(s,a), \]
Therefore, the approximation of the optimal policy is chosen greedily: \[ \pi'(s) = \operatorname*{argmax}_a q_\pi(s,a). \]

12.2.3.1. Policy Improvement Theorem

\[ \forall s\in \mathcal{S}, v_\pi(s) \le v_{\pi'}(s) \]
If \(\pi\) is greedy with respect to \(v_{\pi'}\), then \(\pi = \pi_*\).

12.2.3.2. Monte Carlo Control

By the constant-α Monte Carlo the Q-table is obtained. The policy can be improved greedily with respect to the Q-table: \[ \pi'(s) = \operatorname*{argmax}_aQ(s,a). \]

12.2.3.2.1. Exploration-Exploitation Trade-Off

To discover optimal policies, all state-action pairs must be explored, but to get high returns, one must exploit known high value pairs.
If the policy does not explore the unexplored state, the policy might never be able to discover the state with high state value.
Therefore, the policy needs to be soft: \[ \forall s\in \mathcal{S}, \forall a\in \mathcal{A}(s), \pi(a|s) > 0. \]
- with infinite data, \(\pi_*\) is always discoverable, if the policies are chosen to be soft.

12.2.3.2.2. ε-Greedy Policy

With probability \(\epsilon\), take an action selected uniformly form \(\mathcal{A}(s)\) otherwise select the maximum Q.

12.3. Temporal Difference Control

As long as each process does their job partially, the algorithm will converge toward the optimum.

12.3.1. n-Step Sarsa

On Policy TD Control
The temporal difference learning is used. And the updates are applied during the epoisodes.
"Sarsa" comes from 1-step Sarsa, in which every update operates on \(s_t^m, a_t^m, r_{t+1}^m, s_{t+1}^m, a_{t+1}^m\).

12.3.2. Q-Learning

n-Step Q-Learning

12.3.2.1. Target

The target is adjusted to:
- \[ g_{t;t+n}^m = \sum_{\tau = t+1}^{t+n < T}\left(\gamma^{\tau-t-1}r_\tau^m\right) + \gamma^n \max_aQ(s_{t+n}^m, a). \]
  - If \(t+n \ge T\), it is \(g_{t;t+n}^m = g_t^m\).
- Assume the best case scenario at the last state.
The max operator means this is off-policy, since the return is not according to the policy.
Under the behavior policy, \(Q\) is targeting \(q_*\). It is following the policy \(\pi_\text{greedy}\) defined to choose the maximum action value.
- In contrast to the on-policy methods, in which we are approximating the optimal policy with the previous best behavior policy.
The updates happen before the actions are taken, unlike TD in which updates happen after the actions are taken.

12.3.2.2. Properties

With the maximum operator the creeping in of nearby low values is prevented, as presented in the example of cliff walking.

12.3.3. Expected Sarsa

12.3.3.1. Target

The target is adjusted to:
- \[ g_{t;t+n}^m = \sum_{\tau = t+1}^{t+n < T}\left(\gamma^{\tau-t-1}r_\tau^m\right) + \gamma^n \sum_a \pi(a|s_{t+1}^m)Q(s_{t+n}^m, a). \]
  - If \(t+n \ge T\), it is \(g_{t;t+n}^m = g_t^m\).
Expected return at the last state.
It is less noisy, but the computation is more expensive.
This is on-policy method, if \(b = \pi\).
Expected sarsa is more efficient version of sarsa. Sarsa estimates \(Q\) only with sampling, expected sarsa does it also with expected value.

12.4. Function Approximation

Function Approximation | Reinforcement Learning Part 5 - YouTube

12.4.1. Generalization

If an agent only visits a minuscule subset of all states, how should it generalize that experience to select high reward actions over the remaining states?

Supervised Learning
- Learn from noisy samples.
- SL applied for RL is called function approximation.

12.4.2. On Policy Evaluation

When \(\mathcal{S}\) is large, the degree of freedom is reduced by assuming: \[ v_\pi(s) \approx \hat{v}(s, \mathbf{w}) \]
where \(\mathbf{w}\in \mathbb{R}^d\).
- Since a state \(s\) is arbitrary there must exists features \(x_i : \mathcal{S}\to \mathbb{R}^n\) that describes the state.

12.4.2.1. Value Error

\[ \overline{\mathrm{VE}}_\pi(\mathbf{w}) = \sum_{s\in\mathcal{S}}\mu_\pi(s)\left[v_\pi(s) - \hat{v}(s,\mathbf{w})\right]^2 \] where \(\mu_\pi\) is a distribution over states according to the policy \(\pi\).

12.4.2.2. Stochastic Gradient Descent

The goad is to find \(\hat{v}\) that approximately optimize the value error.

12.4.2.2.1. Assume

On policy: States are visited in proportion to \(\mu_\pi\).
\(\hat{v}\) is differentiable with respect to \(\mathbf{w}\)
The observation \(U_t\) is a surrogate for the target \(v_\pi(S_t)\)

12.4.2.2.2. Update Rule

\[ \mathbf{w} \leftarrow \mathbf{w} + \alpha[U_t - \hat{v}(S_t, \mathbf{w})]\nabla\big|^\mathbf{w} \hat{v}(S_t, \mathbf{w}) \]
If the target \(U_t\) is unbiased, then \(\mathbf{w}\) will converge to a local optimum of value error.

12.4.2.3. Gradient Monte Carlo

MC with modified update rule.

12.4.2.3.1. Target

\(U_t = G_t\)

12.4.2.3.2. Properties

It may be significantly slow, since it requires an episode to end before it updates.
If the value estimator function is a linear value function of the form \[ \hat{v}(s,\mathbf{w}) = \mathbf{w}^\top\mathbf{x}(s) \] where \(\mathbf{x}(s)\) is a feature vector, the local minimum \(\mathbf{w}\) is the global mimimum \(\mathbf{w}\).
- Since the linearity makes the optimization convex.

12.4.2.4. Semi-Gradient Temporal Difference

Semi-Gradient TD

12.4.2.4.1. Target

\[ U_t = R_{t+1} + \gamma \hat{v}(S_{t+1}, \mathbf{w}) \]
It's biased. Since \[ \mathrm{E}[\hat{v}(S_{t+1}, \mathbf{w})|S_t = s] \neq \mathrm{E}[v_\pi(S_{t+1}) | S_t = s] \implies \mathrm{E}[U_t| S_t=s] \neq v_\pi(s) \]

12.4.2.4.2. Properties

Update rule is not a true gradient step, because \(U_t\) depends on \(\hat{v}\) which depends on varying \(\mathbf{w}\).
- \[ \nabla (\overline{\mathrm{VE}}_\pi(\mathbf{w}, U_t)) \neq (\nabla \overline{\mathrm{VE}}_\pi)(\mathbf{w}, U_t) \] therefore following the gradient does not maximally minimizes the value error.
- And \(\mathrm{E}[\hat{v}(S_t, \mathbf{w})] \neq \mathrm{E}[v_\pi(S_t)]\) either, even for the VE optimum \(\hat{v}\).
- That means we don't get graranteed convergence
In 1-step case, for the linear value function, \(\mathbf{w}\) will converge to a TD fixed point \(\mathbf{w}_\text{TD}\).
- The closeness to the VE optimum is related to the attenuation factor \(\gamma\). For high \(\gamma\) near 1, the solution can be far apart from the true optimum.

12.4.3. On Policy Control with Function Approximation

Assumption
- Action space \(\mathcal{s}\) is not large.
- Episodic case. otherwise it is different
The state value function \(\hat{v}(s, \mathbf{w})\) is now action value function \(\hat{q}(s,a,\mathbf{w})\).

12.4.3.1. Semi-Gradient 1-Step Sarsa

12.4.3.1.1. Target

\[ U_t = R_{t+1} + \gamma \hat{q}(S_{t+1}, A_{t+1}, \mathbf{w}) \]
Normalized radial basis used as the features for a state.

12.4.4. Off Policy Methods with Function Approximation

12.4.4.1. The Deadly Triade

Function Approximation
Off-Policy Training
Bootstrapping
\(\leadsto\) stability and divergence issues.
It is inevitable.

12.5. Policy Gradient Methods

Policy Gradient Methods | Reinforcement Learning Part 6 - YouTube
More direct approach to the problem of finding the optimal policy.
The goal is to directly determine the parameter \(\bm{\theta} \in \mathbb{R}^{d'}\) that optimizes the policy \(\pi(a|s, \bm{\theta})\).

12.5.1. Reinforce Algorithm

Monte Carlo algorithm
- Update happens after a full episode is performed.
Assume
- Policy \(\pi(a|s, \bm{\theta})\) is differentiable
Initial setup
- Functional form of \(\pi\)
- Initial \(\bm{\theta}\)
- Step size \(\alpha\)
Softmax function is often used to unconstrain the parameter \(\theta_i\).
- And the policy at one point is the average of the parameters weighted by the normalized radial basis, and then applying softmax to it.
- It is not exactly linear but good enough.
Update Rule
- \[ \bm{\theta} \leftarrow \bm{\theta} + \alpha g_t^m \nabla\ln\pi(a_t^m|s_t^m, \bm{\theta}) \]
- This reinforces the probability of taking action \(a_t^m\) when in state \(s_t^m\) in proportion to the return \(g_t^m\).
- \(\ln\) is there to compensate for the effect of selection frequency on descent.
  - The high selection rate is exactly canceled by dividing the step by the propability.

12.5.2. Reinforce with Baseline

If \(\forall g_t^m \gg 1\), the optimization will take quite long. because the reinforce algorithm will never learn to not do things, only do more good things.
- Replace \(g_t^m\) with \(g_t^m - b(s_t^m)\), where \(b\) is called the baseline.
- Baseline can be any function that does not depend on action and does not make the problem diverge.
- Natural choice of the baseline is the state value, the expected return of that state:
  - \[ b(s_t^m) = \hat{v}(s_t^m, \mathbf{w}) \]
  - The agent would prefer the action that increases the expected return.

12.5.2.1. Update Rule

\begin{align*} \delta &\leftarrow g_t^m - \hat{v}(s_t^m, \mathbf{w})\\ \mathbf{w} &\leftarrow \mathbf{w} + \alpha^{\mathbf{w}}\delta \nabla\big|^\mathbf{w} \hat{v}(s_t^m, \mathbf{w})\\ \bm{\theta} &\leftarrow \bm{\theta} + \alpha^{\bm{\theta}}\gamma^t\delta \nabla\big|^{\bm{\theta}}\ln\pi(a_t^m|s_t^m, \bm{\theta})\\ \end{align*}

It is exactly the combination of ((66865702-a5f4-4c57-a58b-0a182067473b)) and Reinforce algorithm.

12.5.3. Policy Gradient Theorem

The objective to optimize in the policy gradient method is
- Episodic case: \[ v_{\pi_{\bm{\theta}}}(s_0) = \mathrm{E}_{\pi_{\bm{\theta}}}[G_0|S_0 = s_0] \]
Theorem
- \[ \nabla\big|^{\bm{\theta}} v_{\pi_{\bm{\theta}}}(s_0) \propto \sum_{s\in \mathcal{S}}\mu_{\pi_{\bm{\theta}}}(s)\sum_{a\in \mathcal{A}(s)}q_{\pi_{\bm{\theta}}}(s, a)\nabla\big|^{\bm{\theta}}\pi(a|s, \bm{\theta}) \]
  - The gradient of the state value of the initial state, i.e. the gradient of the total return is proportional to the action value weighted average of the gradient of policy at each state-action pair.

If an algorithm approximates this direction \(\nabla\big|^{\bm{\theta}}v_{\pi_{\bm{\theta}}}\), it'll approximately optimize the objective.

12.5.4. Properties

PGM may be slow at converging because it takes smooth transition for it to converge, but because of that it is more likely to converge eventually without getting stuck.
It solves only for the policy.
- No detailed knowledge is gained.
- It bypasses the complexity.
It can learn stochastic policies.
- The level of stochasticity can be learned!
- On the other hand, \(\epsilon\)-greedy method only has a fixed stochasticity.
It avoids computationally intensive argmax operations.

12.5.5. Examples

Proximal Policy Optimization

13. Interpretation

13.1. Mechanistic Interpretability

The embedding of the input tokens are modified as it goes through attention layer and multilayer perceptron. The sequence of modified embeddings is called the residual stream.
The final row that determines the next token can be unembedded at any point within the residual stream. Often the softmax is used that limits the models confidence in next token. When the unembedded tokens changes drastically in quality, it can be suspected that the layer in which the change occured controls certain quality in the next token.
A single neuron exhibit polysemanticity, in which a concept is expressed with a combination of neurons and the neural network can learn more concept than they have neurons.
The final row can be transformed into a sparse concept/feature vector with larger length, using a sparse autoencoder.

rare features are harder to extract using the autoencoder.
Cross-layer superposition is being under research.

14. History

AlphaGo beats Lee Sedol in 2016.

14.1. Moore's Law

AI's Version of Moore's Law? - Computerphile - YouTube The power of the AIs are growing exponentially, in terms of the time for the experts to complete a task that AI gives 80% correct result.

15. Products

15.1. Image Generation

Imagen - Google
Flux.1 - Black Forest Labs
Midjourney

15.2. Language Model

GPT - OpenAI
- ChatGPT
Grok - X
Llama - Meta
PaLM - Google

16. Tools

Python Libraries: and TensorFlow
Regression is producing a model for continuous variable, and classification is for discrete variable.

17. Applications

Nonlinear coordinate transformation using autoencoder
Linear approximation of the nonlinear dynamics can be found using SINDy
- SINDy uses the loss function:

\begin{equation*} \Vert \mathbf{x} - \psi(\mathbf{z})\Vert_2^2 + \lambda_1\Vert \mathbf{\dot{x}} - \nabla_{\mathbf{z}}(\psi(\mathbf{z}) )\Theta(\mathbf{z}^{\mathsf{T}})\Xi\Vert_2^2 + \lambda_2\Vert \nabla_{\mathbf{x}}(\mathbf{z})\mathbf{\dot{x}}- \Theta(\mathbf{z}^{\mathsf{T}})\Xi\Vert_2^2 + \lambda_3\Vert \Xi \Vert_1 \end{equation*}

18. Reference

Rectifier (neural networks) - Wikipedia
Softmax function - Wikipedia
Machine learning - Wikipedia
The Diffusion Model's Unsung Sidekick: A Science of Solving (Almost) Any Prob…
Why Does Diffusion Work Better than Auto-Regression? - YouTube
Stochastic gradient descent - Wikipedia
Learning rate - Wikipedia
How AI Learns Concepts - YouTube
How Computers Derive Equations - YouTube
Reinforcement Learning By the Book - YouTube
[1] R. Sutton and A. Barto. Reinforcement learning: An Introduction (2nd Ed). MIT Press, 2018.
[2] H. Hasselt, et al. RL Lecture Series, Deepmind and UCL, 2021, DeepMind x UCL RL Lecture Series - Introduction to Reinforcement Learning
[3] J. Achiam. Spinning Up in Deep Reinforcement Learning, OpenAI, 2018
[4] J. Achiam, Spinning Up in Deep Reinforcement Learning: Intro to Policy Optimization, OpenAI, 2018, Part 3: Intro to Policy Optimization — Spinning Up documentation
[5] D. Silver, Lecture 7: Policy Gradient Methods, Deepmind, 2015, RL Course by David Silver - Lecture 7: Policy Gradient Methods - YouTube
The Dark Matter of AI {Mechanistic Interpretability} - YouTube
Deep Learning to Discover Coordinates for Dynamics: Autoencoders & Physics In…